Phenotype prediction using biologically interpretable neural networks on multi-cohort multi-omics data

Integrating multi-omics data into predictive models has the potential to enhance accuracy, which is essential for precision medicine. In this study, we developed interpretable predictive models for multi-omics data by employing neural networks informed by prior biological knowledge, referred to as visible networks. These neural networks offer insights into the decision-making process and can unveil novel perspectives on the underlying biological mechanisms associated with traits and complex diseases. We tested the performance, interpretability and generalizability for inferring smoking status, subject age and LDL levels using genome-wide RNA expression and CpG methylation data from the blood of the BIOS consortium (four population cohorts, Ntotal = 2940). In a cohort-wise cross-validation setting, the consistency of the diagnostic performance and interpretation was assessed. Performance was consistently high for predicting smoking status with an overall mean AUC of 0.95 (95% CI: 0.90–1.00) and interpretation revealed the involvement of well-replicated genes such as AHRR, GPR15 and LRRN3. LDL-level predictions were only generalized in a single cohort with an R2 of 0.07 (95% CI: 0.05–0.08). Age was inferred with a mean error of 5.16 (95% CI: 3.97–6.35) years with the genes COL11A2, AFAP1, OTUD7A, PTPRN2, ADARB2 and CD34 consistently predictive. For both regression tasks, we found that using multi-omics networks improved performance, stability and generalizability compared to interpretable single omic networks. We believe that visible neural networks have great potential for multi-omics analysis; they combine multi-omic data elegantly, are interpretable, and generalize well to data from different cohorts.


L1 regularization
The L1 penalty is implemented using the Tensorflow kernel regularizer method.For each layer with the L1 penalty the following term is added to the loss function.
With r as a hyperparameter balancing the L1 term and with N as the number of weights in that layer.For regression tasks, the binary cross-entropy (BCE) loss was used while for regression tasks the mean squared error (MSE) was utilized in the loss function.A custom L1 regularization function was written for the omic-specific L1 penalty.In this function, only the weights of a specific omic are penalized.
For each layer except for the output layer, a default of r = 1×10 −8 was chosen.The L1 penalty hyperparameter for the output layer was optimized per cohort and task.We added this stronger L1 penalty to improve interpretation for the output layer, since this is a fully connected neuron unguided by prior biological knowledge.

Baseline neural network
To compare the performance of visible neural networks to regular neural networks, we trained a threelayer deep neural network using the same cohort-wise cross-validation setup.A three layer fully connected network (100, 50, 1 nodes) resulted in more than 48 million trainable parameters.This fully connected network did not converge satisfactory and therefore we proceeded with the LocallyConnected layer.The resulting baseline networks consists thus of a LocallyConnected layer with a window size of 20 and stride 10, followed by a fully connected layers with 50 neurons and a fully connected layer with a single output neuron.For the multi-omics data, the methylation and gene expression data were first concatenated before feeding it to the network.We optimized the learning rate and architectural choices (type and number of layers) on the validation set.
The LocallyConnected1D layer has been successfully used in various application in genomics [2][3][4] .This layers can be seen as a hybrid between convolutional and fully-connected layers 5 , using a number of filters with a certain window size that are repeated with a specific stride.Unlike in CNNs, where filters slide over the input, the filters in LocallyConnected are static, akin to those in fully connected networks.

PCA on the activations of the neurons
For each patient all the activation values of the neural network are gathered and concatenated.The activation value is the value obtained after applying the activation function (ReLu, tanh, or sigmoid).The resulting matrix, with dimensions equal to the number of individuals in the training set and the number of neurons in the whole network, is used for principal component analysis.Scikit-learn's PCA function sklearn.decomposition.PCA (v1.4.2) has been used to do the principal component analysis.
A neural network can lean multiple patterns to predict a single outcome.The contribution scores based on the weights provide a global interpretation that fails to show this individual difference.The weights represent an average over the whole population.Individually, the neural network could use different patterns.A PCA on the activation can show the different activation patterns for the nodes, which can be leveraged to find groups of individuals that have a different underlying mechanism that still leads to the same outcome.

Tables 1. Performance of regularized linear and logistic regression Validation set Rotterdam Study
Supplementary Table 1.Performance of out-of-the-box scikit-learn implementations.LogisticRegression with an L1 penalty forpredicting smoking status and LASSO for LDL and Age prediction.2.

1.09 (0.86 -1.32) SupplementaryTable 2 .
Baseline, three-layer deep neural network (locallyconnected1D layer followed by two dense layers), performance for cohort-wise cross validation, mean with 95% confidence interval over 10 runs.The area under the curve is reported for the classification task (smoking status prediction) and the root mean squared error (RMSE) for the regression tasks, predicting age and LDL levels.ME; Methylation, GE; Gene expression, ME+GE, both methylation and gene expression as an input for the neural network.RS; Rotterdam study, LL; LifeLines, NTR; Netherlands Twin Register, LLS; Leiden Longevity Study.Supplementary Table3.Hyperparameters for the best performance in the validation set.Batch size was in all experiments 64, L1 penalty value (L1) was on the kernel regularization penalty, learning rate (LR) was for the ADAM optimizer.

Table 4 .
Results for the base network and variations to the base network.Area under the curve for smoking status prediction for each cohort in a cohort-wise cross validation, mean with 95% confidence interval over 10 runs.

Table 5 :
Genes with contributions higher than 1% of the total weight for smoking prediction for three out of the

.63 -0.91) SupplementaryTable 7 .
Results for the base network and variations to the base network.Age prediction performance in explained variance (R 2 ) for each cohort in a cohort-wise cross validation, mean over 10 runs with 95% confidence interval.